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1 Introduction 

Machine learning and the algorithms used for it have become more and more 
complex in the past years. Especially the growth of Deep Learning architectures 
has resulted in a large number of hyperparameters - such as the number of hid¬ 
den layers or the transfer function in a neural network - which have to be tuned 
to achieve the best possible performance. 

Since the result of a hyperparameter choice can only be evaluated by completely 
training the model, every single hyperparameter evaluation requires potentially 
huge compute costs. Depending on the task, such an evaluation can take from 
hours to weeks, for potentially hundreds of evaluations. 

There exist several methods to select the hyperparameters. The one most com¬ 
monly used is grid search, which evaluates all combinations of chosen values 
for each parameter. Clearly, this scales exponentially in the parameter dimen¬ 
sions. Additionally, this risks evaluating parameter combinations where the only 
change is varying an insignificant parameter multiple times. 

In comparison, a continuous evaluation of randomly chosen parameter values as 
presented in |4] avoids the latter problem, and has the advantage of being easily 
scalable and easy to parallelize. 

In general, most projects use grid search or random search in combination with 
a human expert who determines small areas of the hyperparameter space to be 
automatically evaluated. This method obviously depends on expert knowledge 
and is therefore very difficult to scale and not transferrable. 

Bayesian Optimization, which constructs a surrogate function using Gaussian 
Processes, aims to rectify this and has been shown to deliver good results on the 
hyperparameter optimization problem, for example in psi- 

The apsis toolkit presented in this paper provides a flexible framework for 
hyperparameter optimization and includes both random search and a bayesian 
optimizer. It is implemented in Python and its architecture features adaptability 
to any desired machine learning code. It can easily be used with common Python 
ML frameworks such as scikit-learn M- Published under the MIT License other 
researchers are heavily encouraged to check out the code, contribute or raise any 
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suggestions. 

The code can be found at github.com/FrederikDiehl/apsis. 

In chapter [2] the concept of Bayesian Optimization is briefly introduced and 
some theoretical results apsis is based on are presented. The next chapter covers 
apsis ’ flexible architecture followed by a chapter on performance evaluation. 
Finally the conclusion lists possible further steps to the take the project to the 
next level. 

We want to thank our supervisors Prof. Dr. Daniel Cremers and Dipl. Inf. Justin 
Bayer for their outstanding support and helpful contributions. 

2 Bayesian Optimization for Optimizing 
Hyperparameters 

The general objective in hyperparameter optimization is to minimize a loss func¬ 
tion L or other performance measure of a machine learning algorithm A with 
respect to the hyperparameter vector A on withheld data. 



In the following the true objective function will be called and the space of 
hyperparameters will be called A. 



Fig. 1 . Concept of sequential model-based optimization using a Gaussian Process 
model and an Expected Improvement acquisition function, here shown for a ID toy 
minimization problem. 

Since every evaluation of IF with respect to a certain A incurs high cost, one 
goal should be to minimize their number. In Bayesian Optimization, a surrogate 
model is built to approximate the objective function IF. Nearly all of the Bayesian 
Optimization literature uses Gaussian Processes as a surrogate function. This is 
constructed from already evaluated samples. 

The next candidate hyperparameter values to be evaluated is obtained from the 
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surrogate model using a utility or acquisition function u. 

Figure [T| illustrates these concepts. Following this procedure the problem of hy¬ 
perparameter optimization now turns into the problem of maximizing the acqui¬ 
sition function u. 

In total, the Bayesian Optimization procedure is shown in algorithm[l] and is also 
called sequential model based optimization algorithm (SMBO) in the literature. 
In this algorithm, a model M t is used to approximate the complex response 
function ^(A). The history of points where has been evaluated is stored in 
H. We run this algorithm for a fixed number of T step. In each of the steps 
we first search the extremum of the acquisition function u on the model M t _ i. 
Afterwards, we execute the expensive evaluation of & at the found extremum 
A* and add this evaluation to our history. Finally we use the updated history 
information to update M t . Figure [2] exemplarily shows this iterative approach 
for a one dimensional objective function. 


Algorithm 1 Bayesian Optimization Algorithm [4]. Notation adapted. 

1: function BayOpt(iA Mo, T, u) 

2: H <- 0; 

3: for all t G {1..T} do 

4: A* <— argmin\(u{\\M t -i))\ 

5: Evaluate ^(A*); 

6: H^HU(A*,^(A*)); 

7: M t refit surrogate model M t -1 to updated H\ 

8 : end for 

9: return H ; 

10: end function 


2.1 Acquisition Functions 

One of the most important design decisions is the acquisition function used to in¬ 
terpret the model. Most related papers introduce the Probability of Improvement 
function as a starting point and historical reference, but in general, Expected 
Improvement works significantly better in practice. Having implemented both of 
these functions, apsis uses Expected Improvement as a default. 


Probability of Improvement was first shown by Kushner m and states 


the simple idea of maximizing the probability to achieve any improvement by 
choosing A as a next point to sample from 


/ MM t (A) - /(A*) \ 

V cr Mt ( A;6>) ) 


■upi(A|M t ) = 


( 1 ) 
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Fig. 2. Minimization of a one dimensional function with Bayesian Optimization with 
a GP and Expected Improvement acquisition function. First, only a few (randomly 
chosen) samples are available. Now the acquisition function balances the trade-off be¬ 
tween exploration and exploitation having its maximum to be at the point \t$. ^(A* 5 ) 
is evaluated and the GP posterior is updated at t= 6 . 
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That approach already indicates that it will favour exploitation of well-known 
areas over exploration of less-known regions of your optimization space A. 


Expected Improvement A better criterion is the expected value of the actual 
improvement achieved when choosing to evaluate the objective for the next A 
value. 


/ oo 

max(y* - y, 0) 

'-v-' 


value of improvement 


Pm(v I A) dy 

probability of improvement 


( 2 ) 


Since this integral cannot be directly computed, an analytical solution is helpful. 
For Gaussian Processes it has been shown E3IE] that the analytical solution of 
basic El is obtained as 


«Ei(A|Af t ) = u(A) • (z(A) • <P(A) + 0( A)) 


(3) 


using z( A) defined as 


z(A) = 


/(A*) ~ M(A) 
ct(A) 


where ju( A) and a (A) are the mean and standard deviation as given from the 
Gaussian Process used. 0(A) and $(A) mark the standard normal distribution 
density and cumulative distribution function. Since apsis is supposed to handle 
both problems of maximization or minimization of the objective function both 
scenarios had to be incorporated into the acquisition function. Additionally, a 
trade-off parameter ( to control the exploitation/exploration behaviour has been 
incorporated into u as suggested in [5]. This leads to the following modified 
version of z(X) that is used in apsis , where MAX is an integer being 1 when the 
objective function has to be maximized and 0 otherwise. 


*(A) = 


(- 1 ) 


MAX 


(/( a*)- m (A) + c) 


r(A) 


2.2 Expected Improvement Optimization 

One possibility for optimizing the acquisition function is to use gradient based 
optimization methods as they usually feature a better convergence speed than 
methods not relying on first order derivative information. 


El Gradient Derivation Hence the gradient of El had to be analytically 
derived. Applying the product rule to equation © one obtains 

VEI(X) = Vcr(A) • b(A) • «P(A) + 0(A)) + a( A) • V(*(A) • «P(A) + 0(A)) (4) 
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V£7( A) 


Ver(A) • EI{ A) 
ct(A) 

+ cr(A) • (Vz(A)^(z) + z(\)4>(z) • Vz(A) - z(\)<t>(z) ■ Vz{ A)) (5) 

=0 


In the above V^(x) = —x • </>(x) has been used and the last two terms of the sum 
cancel out leading to the compact result 

V£7(A) = v y> + <r(A) • (Vz(A)*(z)). (6) 


Finally, we compute the derivative of z( A) using the product rule. 


V*(A) = 


(- 1 ) 


MAX 


-V/x(A) (-1) 


MAX 


cr(A) cr 2 (A) 

(-l)MAX . _V/i(A) z(A) • Vcr(A) 


• (/(A*) -//(A) + C) • Vcr(A) 


r(A) 


r(A) 


( 7 ) 

(8) 


It is more convenient for the implementation to have V<r 2 (A) instead of V<x(A), 
since the GP framework used in apsis returns both the mean and variance gra¬ 
dients. Fortunately, using the chain rule, one can derive 


Vcr(A) 


V<t 2 (A) 

2cr(A) 


( 9 ) 


Using all of the above and inserting it in ([6]), the full El gradient result is 
V£/(A) = I^aT - ( ~ 1)MAX ' VMA) ‘ * {z{x)) ~ v<j2(A) ' ‘ 


which is the formula implemented in apsis. 


Available Optimization Methods The following gradient and non-gradient 
based optimization methods are available in apsis. 

— Quasi-Newton optimization using inverse BFGS [20^ p. 72] 

— Limited Memory BFGS with bounds algorithm (default) [6 

— Nelder-Mead method [12] 

— Powell method m 

— Conjugate Gradient method m 

— inexact/truncated Newton method using Conjugate Gradient to approxi¬ 
mately solve the Newton Equation 20. p. 62] [IT 

— random search 

Except for random search the implementations of these methods are taken from 
the SciPy project [8]. In order to provide these methods with a promising start¬ 
ing point Aq a random search maximization is always executed first and the best 
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resulting A will be used as Aq for any of the further optimization methods. 


The BFGS method marks the strongest of these methods and offers the best 
convergence speed since it uses gradient information and approximates the sec¬ 
ond order information. Similar to the original Newton method it computes the 
next optimization step Sk by solving the Newton equation 

V 2 /(Afc) • = -V/(A*) (10) 


but as a so called Quasi Newton method it tries to approximate the Hessian V 2 . 
Therefore it maintains the Quasi Newton equation © satisfied in every step to 
make sure H provides a well enough approximation for the Hessian. 


H k+1 • (A fe+1 - \ k ) = V/(A fe+1 ) - V/(A fe ) 


( 11 ) 


Since equation (10) can be solved for Sk by inverting the Hessian, the inverse 


BFGS method used in apsis directly approximates the inverse of the Hessian 
using an iterative algorithm to speed up optimization. 

It will however not always converge since the algorithm involves a division by 
something that can become 0 [20j p. 72]. In that case the algorithm stops grace¬ 
fully with an error and apsis relies on the random search result. Since the op¬ 
timization space is bounded in real settings apsis by default uses an adopted 
version of BFGS, the L-BFGS-B [6: algorithm. In contrast to ordinary BFGS it 
does not store the full n x n Hessian but only stores a few vectors to implicitly 
represent the Hessian approximation H. Furthermore it has been extended by 
Byrd et al. ]0 to respect simple bounds constraints for each variable definition 
which is sufficient for the bounds used in apsis. 


3 Architecture 

3.1 General Architectural Overview 

The architecture of apsis is designed to be interoperable with any Python ma¬ 
chine learning framework or self-implemented algorithm, apsis features an ab¬ 
straction layer for the underlying optimization framework as depicted in figure 
[3J Every optimizer adheres to the abstract base class Optimizer. The optimizer 
uses three important model classes to control optimization. Candidate is used 
to represent a specific hyperparameter vector A. Experiments are used to de¬ 
fine a hyperparameter optimization experiment. An Experiment holds a list of 
ParamDef objects that define the nature of each specific hyperparameter. The 
external program interacts with apsis through a set of Assistants. They help to 
initialize apsis ’ internal structure and models and provide convenient access to 
optimization results, and can optionally store and plot them or compare several 
experiments. Additionally, the external program has to provide some integration 
code between the external program’s machine learning algorithm and the apsis 
interface. Since the external program could be multi-threaded or clustered apsis 
refers to the processes running the machine learning algorithms as Workers. 






external interaction 


apsis internals 


Workers run the actual 
algorithm for which the 
parameters are optimized. 



algorithm to optimize for. methods to initialize 

experiments and 
helps with result 
storage and nice plots 



gamma=4 


Fig. 3. apsis general architectural overview 


On the implementation level apsis is made up by six packages reflecting this 
architecture. Table [l] lists all of them and briefly describes their purpose. 


3.2 Model Objects 

Candidate represents a specific hyper parameter vector A and optionally stores 
the result of the objective function \P(\) achieved under this A if already eval¬ 
uated. Additionally it can store the cost occurred for evaluation and any other 
met a information used by the worker, e.g. the worker could keep track of the 
classifier’s weights in each step and store them there. The actual vector A is 
stored as a dictionary such that every parameter dimension A^ is named. 


Experiment stores information about the nature of the parameters to be opti¬ 
mized, defines if the problem is for minimization and maximization. It keeps track 
of successfully evaluated, currently evaluated and to be evaluated Candidates. 
It stores the best Candidate and optionally names the experiment. 
Additionally, it provides methods for semantic check if candidates are valid for 
this experiment, and can convert itself to a CSV format for result storage. 


ParamDef is the most general superclass used to define the nature of one hy¬ 
perparameter dimension. It makes no assumption on the nature of the stored pa¬ 
rameter. NominalParamDef defines non-comparable, unordered parameters and 
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Package 

Description 

Contents 

assistants 

Contains lab and experiment assistants for 
apsis 5 external interface. 

experiment-assist ant. py 
lab-assist ant. py 

optimizers 

Contains all available optimizers and the abstract 
base class. Contains a subpackage for 

Bayesian Optimization. 

optimizers.py 
random_search.py 
bayesian_optimization.py 
bayesian/acquisition functions.py 

models 

Contains the model classes. 

candidate, py 
experiment, py 
parameter-definition.py 

tests 

Contains unit tests for complete apsis code. 

test .assistants /... 
test-models/... 
test .optimizers /... 
test-utilities/... 

utilities 

Utility functions used accross all packages. 

benchmark-functions.py 

file_utils.py 

import-Utils.py 

logging-utils.py 

optimizer_utils.py 

plot_utils.py 

r andomiz at ion. py 

demos 

Some demos intended for learning how to 
use apsis by examples. 

demo_MNIST.py 

demo MNIST-MCMC py 


Table 1. list of apsis packages with their purpose and contents 


stores all available values in a list. OrdinalParamDef represents ordinal parame¬ 
ters where the order is maintained by the position in the list of values. As ordinal 
parameters can be compared they adhere to the ComparableParamDef interface 
and provide a comp are .values function that provides similar semantic to the 
Python integrated __cmp__ function. Furthermore, apsis comes with special sup¬ 
port for numeric parameters represented by the NumericParamDef class. Numeric 
param defs are comparable and are represented internally by a warping into the 
[0,1] domain. Hence they need to be given an inwards and outwards warping 
function. The warped parameters are now compared and treated according to 
the rules of treating ordinary floats between 0 and 1. To ease initialization of 
numeric parameters MinMaxNumericParamDef automatically provides a warping 
assuming an equal distribution of the parameter between the given minimal and 
maximal value. AsymptoticNumericParamDef provides a parameter definition 
for a parameter which is expected to be close to one value. For example, learn¬ 
ing rates can be estimated as being close to 0. This leads to significantly better 
results during optimization, as long as some expert knowledge is available. 
Figure [4] depicts the structure of available parameter types and there inheritance 
relationships. 













10 



Fig. 4. Overview of ParamDef model classes and their relations. 


3.3 Optimization Cores 

The abstract base class Optimizer defines the common interface of any optimiza¬ 
tion algorithm provided in apsis. Central is the method get_next_candidates. 
Based upon information stored in the given experiment instance the method 
provides its user with one or several promising candidates to evaluate next. On 
construction the optimizer can receive a dictionary of optimizer specific param¬ 
eters to define optimization related hyperparameters. Note that these param¬ 
eters are not the ones that are subject of optimization but are parameters to 
govern the optimization behaviour such as which acquisition function or ker¬ 
nel is used in Bayesian Optimization, apsis provides two different optimization 
cores: a simple random search based optimizer called RandomSearch and the 
SimpleBayesianOptimizer. 


RandomSearch Implements a very simple random search optimizer. For pa¬ 
rameters of type NumericParamDef a uniform random varibale between 0 and 1 
is generated to select a value in warped space for each parameter. For parameters 
of type NominalParamDef a value is drawn uniformly at random from the list of 
allowed values. All random numbers are generated using the numpy [7 random 
package. 


SimpleBayesianOptimizer The SimpleBayesianOptimizer works according 
to the theory described in chapter [2] It is called simple since it currently works 
with one concurrent worker at a time only (though a multi worker variant is 
planned). It uses Gaussian Processes and their kernels provided by the GPy 
framework pQ. For the acquisition functions implemented and the methods in 


use for their optimization it shall be referred to chapter [2TT| and 2.2 
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3.4 Experiment Assistants 


The BasicExperimentAssistant provides the simplest interface between ap¬ 
sis and the outside world. It administers at most one experiment at a time. 
An experiment needs to be initialized by at least specifying a name to iden¬ 
tify it, the Optimizer to be used and a list containing one ParamDef object for 
each hyperparameter to be optimized. It holds and manages the Optimizer and 
Experiment instances and provides an abstraction layer to their interface. 



Fig. 5. The role of experiment assistants in apsis. 


The experiment assistants provide the two methods get_next_candidate and 
update that should be called by the external program when it is ready to eval¬ 
uate a new candidate or wants to notify apsis about work on a Candidate. 
The Candidate doesn’t necessarily need to be one which was provided by ap¬ 
sis but can be any Candidate which adheres to the parameter space. Further¬ 
more BasicExperimentAssistant cares for storing the results to CSV files when 
running to be sure to have all information available after termination or abor¬ 
tion of an experiment run. The behaviour of the CSV writing can be controlled 
upon initialization of the experiment assistant. As an extension apsis provides 
PrettyExperimentAssistant that can additionally create nice plots on the ex¬ 
periment. Chapter [4] shows some of these plots. 
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3.5 Lab Assistants 



Fig. 6. Using lab assistants to manage and compare several experiments at once. 


BasicLabAssistant deals with the requirement to run and control several ex¬ 
periments at once. It holds a dictionary of experiment assistants and provides a 
common interface to all of them following the same semantic as the experiment 
assistants themselves. The advantages of using lab assistants in contrast to con¬ 
trolling several experiments at the same time manually is that the results can 
now be stored together and compared. BasicLabAssistant provides only that 
function, while PrettyLabAssistant adds comparative plots for the comparison 
of multiple optimizers. 

4 Evaluation 

This section evaluates apsis on several benchmark and one real world exam¬ 
ple. All experiments are evaluated using cross validation and 10 initial random 
samples that are shared between all optimizers in an experiment to ensure com¬ 
parability. 

4.1 Evaluation on Branin Hoo Benchmark Function 

Some recent papers on Bayesian Optimization for machine learning publish eval¬ 
uations on the Branin Hoo optimization function m ■ The Branin Hoo function 
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/BraninOc) = a • (y - b ■ x 2 + c • x - r) 2 + s • (1 - t) • cos(x) + s 
using values proposed in m is defined as 

/Branin fa) = fa ~ ^ ■ X 2 + - • X ~ 6) 2 + 10 • (1 - U • COs{x) + 10. 

47T z 7T 07T 

In contrast to our expectations Bayesian Optimization was not able to outper¬ 
form random search on Branin Hoo. Still the result is much more stable and the 
bayesian optimizer samples only close to the optimum. 



Fig. 7. Comparison of Bayesian Optimization vs. random search optimization on 
Branin Hoo function. The left figure shows the best result in every step. Here, random 
search clearly outperforms Bayesian Optimization. The right plot additionally plots 
each function evaluation as a dot. Here, it is apparent that Bayesian Optimization 
works a lot more stable and does not evaluate as many non-promising candidates as 
random search. 


4.2 Evaluation on Multidimensional Artificial Noise Function 

Compared to random search an intelligent optimizer should be better on less 
noisy function than on very noisy functions in theory. A very noisy function has 
a tremendous amount of local extrema making it hard to impossible for Bayesian 
Optimization methods to outperform random search. To investigate this proposi¬ 
tion an artificial multidimensional noise function has been implemented in apsis 
as shown in figure [8j 

Using this noise function, one can generate multi-dimensional noises with vary¬ 
ing smoothness. The construction process first constructs an n-dimensional grid 
of random points, which remains constant under varying smoothness. Evaluat¬ 
ing a point is done by averaging the randomly generated points, weighted by 
a gaussian with zero mean and varying varianc^] This variance influences the 
final smoothness. A one-dimensional example of generated functions for differing 

1 Actually, only the closest few points are evaluated to increase performance. 
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Fig. 8. Plot of artificial noise function used as an optimization benchmark in apsis. 
This is generated using a grid of random values smoothed over by a gaussian of varying 
variance. 



Fig. 9. Plot of the end result after 20 optimization steps on a 3D artificial noise problem 
depending on the smoothing used. Values to the right are for smoother functions. A 
lower result is better. 
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variances is seen in figure [8] 

The result can be seen in figure [9] As expected, Bayesian Optimization outper¬ 
forms random search for smoother functions, while achieving a rough parity on 
rough functions. 


4.3 Evaluation on Neural Network on MNIST 

To evaluate the hyperparameter optimization on a real world problem, we used 
it to optimize a neural network on the MNIST dataset mi. We used Breze[9] as 
a neural network librar}0in Python. 

The network is a simple feed-forward neural network with 784 input neurons, 
800 hidden neurons and 10 output neurons. It uses sigmoid units in the hidden 
layers, and softmax as output. We learn over 100 epochs. These parameters stay 
fixed throughout the optimization. 

For assigning the neural network weights, we use a backpropagation algorithm. 
Its parameters - step_rate, momentum and decay - are optimized over, as is c w d , 
a weight penalty term. Hence, resulting in a four dimensional hyper parameter 
optimization. 

We ran all neural network experiments with a five-fold cross validation. Even so, 
total evaluation time ran close to 24 hours on an Nvidia Quadro K2100M. 
Figure [lO] shows the performance of the optimizers for each step. As can be seen, 
Bayesian Optimization - after the first ten, shared steps, rapidly improves the 
performance of the neural network by a huge amount. This is significantly more 
stable than the random search optimizer it is compared with. 

However, the optimization above uses no previous knowledge of the problem. 
In an attempt to investigate the influence of such previous knowledge, we then 
set the parameter definition for the step_rate to assume it to be close to 0, and 
the decay to be close to 1. This is assumed to be knowledge easily obtainable 
from any neural network tutorial. 

The effects of this can be seen in figure 0 and are dramatic. First of all, even 
random search performs significantly better than before, reaching a similar value 
as the uninformed Bayesian Optimization. Bayesian optimization profits, too, 
and decreases the mean error by about half. 


2 Breze uses theano [UGS in the background. 
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Fig. 10. Comparison of random search and Bayesian Optimization in the context of 
a neural network. Each point represents one parameter evaluation of the respective 
algorithm. The line represents the mean result of the algorithm at the corresponding 
step including the boundaries of the 75% confidence interval. 



Fig. 11. Comparison of random search and Bayesian Optimization in the context of 
a neural network. This optimization uses additional knowledge in that step_rate is 
assumed to be close to 0 and decay to be close to 1. 











































17 


5 Conclusion 

With apsis we presented a flexible open source framework for optimization of hy¬ 
perparameters in machine learning algorithms. It implements most of the state 
of the art on hyperparameter optimization of current research and is open for 
further extension. 

It is our hope that apsis will continue to be expanded, and will be used exten¬ 
sively in academia. 

The performance evaluation justifies that Bayesian Optimization is signifi¬ 
cantly better than random search on a real world machine learning problem. 
Furthermore the need for an efficient parameter optimization arises already on 
an experiment with only fifteen minutes of computation time per evaluation. 
These settings will certainly be met by any practical machine learning prob¬ 
lem. Including very simple prior knowledge of the algorithm leads to another 
significant performance improvement. 

There are several areas in which apsis can be further improved. We would 
like to implement support for multiple concurrent workers, allowing us to op¬ 
timize on clusters. Adding a REST web-interface allows an easy integration of 
apsis to arbitrary languages and environments. 

There also remain possible improvements on the implemented bayesian optimizer 
itself. [19] propose using early validation to abort the evaluation of unpromis¬ 
ing hyperparameters. m propose learning input warpings, which would further 
reduce the reliance on previous knowledge. [16] use a cost function to mini¬ 
mize total computing time instead of the number of evaluations. The problem 
generally known as a tree-structured configuration space as pointed out by [4] 
could be tackled, allowing us to mark parameters as unused for a certain eval¬ 
uation. Student-t processes might be investigated as a replacement for gaussian 
processes. Lastly, adding efficient support for nominal parameters in Bayesian 
Optimization is important for a good performance, though since no publication 
on that topic exists so far this might be one of the biggest barriers to tackle. 
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